This week we will be using real data collected here in Salem - the Mill Stream dataset. This comes from the continuous monitoring system operated by the City of Salem’s Watershed Council. The dataset is for two monitoring stations along Mill Stream (the same stream that flows through the Salem campus) at locations MIC3 and MIC12 (light green dots in the figure above.)
The dataset has the following variables: * DO (dissolved oxygen) * Stream temperature * Turbidity
Your job is to perform EDA on the dataset, similar to what we did in class. Because there is some missing data, this should include analyzing the pattern of missingness, making a suggestion of what could be causing missing data, and addressing the missingness (e.g. discard or impute using the appropriate method).
Instructions
Missingness Steps:
Evaluate the pattern of missingness - what do you find?
Based on your findings, either discard or impute the missing data. Provide a brief explanation for your choices.
Setup
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsdf = pd.read_csv("./../../datasets/salem_stream_daily_data.csv")df.head()
date_only
Location
avg_temp
avg_do
avg_turbidity
DO_delta
0
2014-12-31
MIC12
4.907500
12.685313
5.718438
NaN
1
2014-12-31
MIC3
4.442188
13.805625
5.217500
NaN
2
2015-01-01
MIC12
4.607500
12.745208
5.380521
0.32
3
2015-01-01
MIC3
4.242813
13.842187
5.018438
0.29
4
2015-01-02
MIC12
4.795104
12.704792
5.119063
0.37
Checking missingness
total_rows = df.shape[0]# check for missingnessdef missing_check(d, field): missing_count = d[d[field].isna()].shape[0]print(f"missing in {field}: {missing_count} ({(missing_count/total_rows)*100:.2f}%)")for col in df.columns: missing_check(df, col)
missing in date_only: 0 (0.00%)
missing in Location: 0 (0.00%)
missing in avg_temp: 0 (0.00%)
missing in avg_do: 102 (1.35%)
missing in avg_turbidity: 830 (11.00%)
missing in DO_delta: 156 (2.07%)
It looks like avg_turbidity has the most missing values at 11%. DO_delta and avg_do have less missing values.
Patterns?
# make masks for fields with missing valuesdf['missing_avg_turb'] = df['avg_turbidity'].isna()df['missing_avg_do'] = df['avg_do'].isna()df['missing_DO_delta'] = df['DO_delta'].isna()# are there any observations that are missing all three of these fields?df['missing_all'] = df[['missing_avg_do', 'missing_avg_turb', 'missing_DO_delta']].all(axis='columns')print("missing all three fields:", df[df['missing_all'] ==True].shape[0])# what about observations that are missing avg_do and do_delta?df['missing_do'] = df[['missing_avg_do', 'missing_DO_delta']].all(axis='columns')print("rows missing avg_do and do_delta:", df[df['missing_do'] ==True].shape[0])# is one location more prone to missingness than another?# there are around the same amount of each location, so a count will work instead of a proportionlocations_missingness = df.groupby('Location')[['missing_avg_turb','missing_avg_do', 'missing_DO_delta']].sum()print("\nMissingness by location:")print(locations_missingness)# any particular dates with more missingness?# (df.groupby('date_only')[['missing_avg_turb','missing_avg_do', 'missing_DO_delta']].sum()).sort_values(by=['missing_avg_turb','missing_avg_do', 'missing_DO_delta'], ascending=False)# not printing that out it's too much, plus it doesn't look like there's any particular pattern
missing all three fields: 27
rows missing avg_do and do_delta: 102
Missingness by location:
missing_avg_turb missing_avg_do missing_DO_delta
Location
MIC12 603 21 47
MIC3 227 81 109
Observations:
Date doesn’t seem to be a good indicator of missingness. There’s no particular date with a lot of missing values.
Location is a better indicator of missingness.
27 rows (<1%) have turbidity, average dissolved oxygen, and dissolved oxygen delta missing. Most of these are from the MIC12 location.
102 rows (1.35%) are missing both average dissolved oxygen and DO_delta, which is the same number of rows that are missing dissolved oxygen; this means that if avg_do is missing, DO_delta is also missing.
Looking at missingness by location, MIC12 has the most missingness in average turbidity, while MIC3 has the most missingness in both dissolved oxygen fields.
What to do:
Since location is a better indication of missingness, we can impute the values for average turbidity and average dissolved oxygen using the averages of each field for each location.
I don’t know how ‘do_delta’ is calculated, so I won’t touch it for now.
# note: this is completely unnecessary, but I spent so much time on it I'm leaving it in here anyway. A better way would be to use ffilllocation_avgs = df.groupby('Location').agg( turbidity_means = ('avg_turbidity', 'mean'), do_means = ('avg_do', 'mean'))print("averages by location:")print(location_avgs)# replacements for turbiditydf.loc[(df['Location'] =='MIC12') & (df['avg_turbidity'].isna()), 'avg_turbidity'] = location_avgs['turbidity_means']['MIC12']df.loc[(df['Location'] =='MIC3') & (df['avg_turbidity'].isna()), 'avg_turbidity'] = location_avgs['turbidity_means']['MIC3']# replacements for avg_dodf.loc[(df['Location'] =='MIC12') & (df['avg_do'].isna()), 'avg_do'] = location_avgs['do_means']['MIC12']df.loc[(df['Location'] =='MIC3') & (df['avg_do'].isna()), 'avg_do'] = location_avgs['do_means']['MIC3']
Use pairplot to examine general correlations and distributions
Plot variables over time to look for longitudinal trends
Plot the distributions of numeric variables to look for outliers and skewness
For the categorical variables, examine differences across categories
Tidy check and Pairplot:
df_eda = df.iloc[:,0:6] # I don't need those flag columns I made earlier# check data typesdf_eda.dtypes# convert date to datetimedf_eda['date'] = pd.to_datetime(df_eda['date_only'])sns.pairplot(df_eda, diag_kind='kde', hue='Location')plt.show()
Plot variables over time to look for longitudinal trends:
# temp over timeg = sns.lineplot(df_eda, x='date', y='avg_temp', hue='Location', alpha=0.7)plt.show
# average dissolved oxygen over timeg = sns.lineplot(df_eda, x='date', y='avg_do', hue='Location', alpha=0.7)plt.show
# dissolved oxygen delta over time g = sns.lineplot(df_eda, x='date', y='DO_delta', hue='Location', alpha=0.7)plt.show
# turbidity over timeg = sns.lineplot(df_eda, x='date', y='avg_turbidity', hue='Location', alpha=0.7)plt.show
Plot the distributions of numeric variables to look for outliers and skewness:
- It looks like there's a correlation between average temperature and average dissolved oxygen. The graph for average DO and average temperature over time looks like they're about the same shape, but flipped. This agrees with the negative correlation shown in the pairplot.
- Average temperature looks like it's around the same for each location.
- DO delta is generally lower at MIC3 than at MIC12.
- Average turbidity has a *lot* of outliers, with most points in the range 0-20. The graph is heavily right-skewed.
- Average temperature and DO delta are also right skewed. Average dissolved oxygen looks like it's the most centered.
Extra Credit
Divide each of the numeric variables by site location, then perform the Shapiro-Wilk test for normality (scipy.stats.shapiro()) on each distribution
Use a t-test (scipy.stats.ttest_ind()) to determine if there is a statistical difference between the two sites for each of the numeric variables.
Provide a brief summary of your findings.
from scipy import stats# separating by location:df_12 = df[df['Location'] =='MIC12']df_3 = df[df['Location'] =='MIC3']numeric_columns = ['avg_temp', 'avg_do', 'avg_turbidity', 'DO_delta']
Shapiro-Wilk tests
Null hypothesis: data is normally distributed
def test_normal(d, col): result = stats.shapiro(d[col], nan_policy='omit')print(f"Shapiro-Wilk test for {col}:")print(f"\tStat = {result.statistic:.2f}\n\tP-value = {result.pvalue}")# check P value against alpha 0.05if result.pvalue <0.05:print("\tReject null hypothesis, data is not normally distributed")else:print("\tDo not reject null hypothesis, data may be normally distributed")print("Normality tests for Location MIC12")for col in numeric_columns: test_normal(df_12, col)print() # make this more readable with an extre newlineprint("Normality tests for Location MIC3")for col in numeric_columns: test_normal(df_3, col)print()
Normality tests for Location MIC12
Shapiro-Wilk test for avg_temp:
Stat = 0.97
P-value = 5.910326307713323e-29
Reject null hypothesis, data is not normally distributed
Shapiro-Wilk test for avg_do:
Stat = 0.99
P-value = 2.460829952075956e-16
Reject null hypothesis, data is not normally distributed
Shapiro-Wilk test for avg_turbidity:
Stat = 0.58
P-value = 6.112837407064026e-70
Reject null hypothesis, data is not normally distributed
Shapiro-Wilk test for DO_delta:
Stat = 0.93
P-value = 3.119279484762433e-39
Reject null hypothesis, data is not normally distributed
Normality tests for Location MIC3
Shapiro-Wilk test for avg_temp:
Stat = 0.96
P-value = 9.427198957141494e-30
Reject null hypothesis, data is not normally distributed
Shapiro-Wilk test for avg_do:
Stat = 0.98
P-value = 7.176432494565019e-23
Reject null hypothesis, data is not normally distributed
Shapiro-Wilk test for avg_turbidity:
Stat = 0.55
P-value = 1.6822305108100698e-71
Reject null hypothesis, data is not normally distributed
Shapiro-Wilk test for DO_delta:
Stat = 0.94
P-value = 4.929401299166423e-35
Reject null hypothesis, data is not normally distributed
T-Tests
Null hypothesis: means of each population is equal
def test_t(d1, d2, col): result = stats.ttest_ind(d1[col], d2[col], nan_policy='omit')print(f"\tStat = {result.statistic:.2f}\n\tP-value = {result.pvalue}")# check P value against alpha 0.05if result.pvalue <0.05:print("\tReject null hypothesis, means are not equal")else:print("\tDo not reject null hypothesis, means may be equal")for col in numeric_columns:print(f"T-test for {col} at MIC12 vs MIC3") test_t(df_12, df_3, col)print()
T-test for avg_temp at MIC12 vs MIC3
Stat = -1.75
P-value = 0.07998935280988098
Do not reject null hypothesis, means may be equal
T-test for avg_do at MIC12 vs MIC3
Stat = -9.76
P-value = 2.2133409075558774e-22
Reject null hypothesis, means are not equal
T-test for avg_turbidity at MIC12 vs MIC3
Stat = 2.31
P-value = 0.02085474692191009
Reject null hypothesis, means are not equal
T-test for DO_delta at MIC12 vs MIC3
Stat = 38.47
P-value = 3.070396703359871e-295
Reject null hypothesis, means are not equal
Findings:
According to the Shapiro-Wilk test, none of the data is normally distributed. This agrees with the shape of the graphs from earlier (none of them looked like they would be normal). This also means that the assumptions of normality for the t-test fails.
According to the T-test, the average dissolved oxygen, turbidity, and dissolved oxygen delta are not equal at each location. The average temperature is the only variable that may be the same at both locations, which is consistent with my guess based on the graph.